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Abstract 

Background: Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. 
While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. 

Results: We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell 
types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine 
classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 
92% of the tissues, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus 
promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning 
distal non-coding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable 
classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly 
expressed compared to lowly expressed genes. Liver enhancer predictions were assessed in vivo using the 
hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity 
in the mouse liver, whereas a control set of five sequences was completely negative. 

Conclusions: We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific 
signatures that can be learned and used for the de novo prediction of enhancers. 



Background 

A fundamental question in biology is how cells and tissues 
differentiate and maintain their identity from essentially 
the same genome. Wide variation in spatial, temporal 
and condition-dependent expression patterns of more 
than 20,000 genes in the human genome [1] is required 
for the establishment and maintenance of different cell 
fates and environmental responses. Tissue-specific genes 
are often implicated in distinct developmental and meta- 
bolic pathways and therefore may constitute good candi- 
dates for biomarkers or drug targets. 

The control of gene transcription is mediated by tran- 
scription factors (TFs), which interact in a sequence- 
specific manner with DNA motifs, known as TF binding 
sites. The promoter is frequently divided into a basal 



* Correspondence: nadav.ahituv@ucsf.edu; ovchareniSnih.gov 
^Department of Bioengineering and Therapeutic Sciences, University of 
California San Francisco, San Francisco, CA 94158, USA 
'Computational Biology Branch, National Center for Biotechnology 
Information, National Library of Medicine, National Institutes of Health, 
Bethesda, MD 20894, USA 

Full list of author information is available at the end of the article 



core, covering approximately 100 bp upstream of the 
transcription start site (TSS), and a proximal promoter, 
which extends up to a few hundred base pairs and typic- 
ally contains multiple TF binding sites [2,3]. In addition 
to promoters, other cM-regulatory sequences, such as en- 
hancers, are specifically bound by TFs and are central 
players in the control of transcription in multicellular 
eukaryotes. The regulation of promoters by distal enhancers 
involves DNA looping or scanning and/or higher-order 
conformation changes in chromatin [4-6], resulting in 
an increase in the local concentration of TFs in the 
vicinity of a promoter and the initiation or enhancement 
of transcription. It has been long recognized that prox- 
imal promoters and enhancers are functionally similar, 
and virtually undistinguishable from each other (see, for 
example, [7,8]). 

Both enhancers and promoters have been shown to 
contain DNA motifs for specific TFs, depending on their 
tissue-specific activities (for example, [8-10]). In par- 
ticular, CpG -depleted promoters are enriched with DNA 
motifs [11], suggesting a distinct regulatory mechanism 
from CpG-rich promoters. The transcription complex 
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LDBl, which involves GATAl, GATA2, TALI, LM02, 
and RUNXl, and has been extensively studied in the 
context of the differentiation of erythroid cells, illuminates 
this distinction. Whereas LDBl binds mostly within CpG- 
depleted promoters, it only binds downstream of CpG- 
rich promoters, often within the first intron of their target 
gene [12]. Additional evidence suggests that such DNA 
motifs representing putative TF binding sites are pre- 
dictive of promoter activity, including tissue-specific 
expression of their target gene (for example, [13,14]). In 
addition, DNA motif enrichment analyses have shown 
that DNA motifs are highly predictive of enhancer 
activity [15-18]. 

Unlike promoters, enhancers can act over very long 
distances. Based on the relative location of conserved 
non-coding elements (CNEs) in the human genome, early 
estimates suggested that a large number of enhancers 
are more than 250 kilobases (kb) away from their target 
gene [19]. For example, a conserved enhancer of Shh 
that is associated with Polydactyly is located 1 megabase 
(Mb) upstream of Shh, within an intron of another gene 
[20]. Furthermore, similar approaches have determined 
that the regulatory elements controlling the transcrip- 
tion of SOX9 are scattered over 1 Mb upstream of its 
TSS [21,22]. More recently, genome-wide chromatin 
interaction analyses have confirmed that such long- 
range interactions are indeed widespread, providing 
evidence that the vast majority of enhancers target genes 
other than their nearest genes [23,24]. Because of their 
genomic distribution and poorly characterized sequence 
features, enhancers have been difficult to identify. Only 
the advent of high-throughput sequencing technologies 
has led to large-scale screens for regulatory sequences 
that are now starting to reveal complete regulatory 
networks and signal transduction pathways in higher 
eukaryotes [25]. Such screens, however, represent a snap- 
shot of a single cell type and set of conditions, and conclu- 
sions cannot, therefore, be easily generalized. 

Previous studies have focused on identifying sequence 
features in either promoters or enhancers, and con- 
structing models that describe these genomic elements, 
individually. Here, we show how the presence and/or 
absence of motifs in the promoter regions of genes with 
tissue-specific expression profiles can be used to reliably 
identify distal enhancers with analogous tissue-specific 
activity. Predicted enhancers are highly enriched in the 
loci of concordantly expressed genes (for instance, in 
the case of predicted liver enhancers, they are five-fold 
more abundant in the loci of most highly expressed liver 
genes than in the loci of lowly expressed liver genes), 
and overlap significantly with chromatin signatures pre- 
dictive of enhancer activity. Experimental validation in 
mice supports the high accuracy of the presented method 
in predicting tissue-specific enhancers. With the advent of 



new technologies and the resulting deluge of expression 
data, approaches exploiting sequence features shared be- 
tween promoters and enhancers hold great promise to 
understanding the cw-regulatory code encrypted in the 
genomes of higher organisms. 

Results and discussion 

Promoters of tissue-specific genes contain tissue-specific 
signatures 

Although most promoters drive basal levels of transcrip- 
tion ubiquitously, some promoters are capable of control- 
ling transcription in a tissue- and/or temporal-specific 
manner [26-29]. Genes controlled by these types of pro- 
moters are expressed in specific tissues and develop- 
mental stages, and may be induced by endogenous or 
exogenous factors. Here, we set out to systematically 
test whether promoters of genes that exhibit a particular 
expression profile in a given tissue contain sequence sig- 
natures that confer tissue-specificity and separate them 
from ubiquitous promoters. For this purpose, we collected 
the promoter regions (from 25 kb upstream to 0.5 kb 
downstream of the TSS; see Materials and methods) of 
the top 200 highly expressed genes in 79 different tis- 
sues and cell types [30] (see Materials and methods). As 
negative controls, we selected the promoters of the 
200 least expressed genes in the same set of tissues. 
Although expression breadth was not considered in the 
construction of such gene sets, most of the genes in the 
sets are only expressed at high (or low) levels in the tissue 
of interest. Even if some genes are expressed across several 
tissues at high or low levels, the sets are highly non- 
overlapping (Supplementary notes in Additional file 1). 
Thus, while the individual genes in a given set are not 
strictly tissue-specific, the set itself is. 

To assess the role of promoters in determining tissue- 
specific expression, we trained a support vector machine 
(SVM) for each of the 79 tissues. More precisely, to dis- 
criminate between promoters of most highly expressed 
and inhibited genes in each tissue, the classifiers relied 
on in silico occurrences of TF binding sites within their 
evolutionary conserved regions (see Materials and methods; 
Supplementary notes in Additional file 1). We evaluated 
the models' ability to accurately predict expression using 
the area under the receiver operating characteristic curve 
(AUG) in a five-fold cross-validation framework. Most 
models (73/79) can reliably distinguish promoters of genes 
most highly expressed in a given tissue from those of lowly 
expressed genes by identifying TFs associated with these 
tissues with median AUG values between 0.60 and 1.00 
(Figure lA). To be specific, for half of the models (39/79), 
we obtained median AUG values higher than 0.8. Moreover, 
when we tested a model trained on a particular tissue 
on the promoters of genes expressed in another tissue, 
we obtained relatively high AUG values mainly for 
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Figure 1 DNA motifs in human promoters predict tissue-specific expression. (A) Area under the receiver operating characteristic (ROC) 
curve for 79 models trained and tested on promoters of genes highly expressed in 79 different tissues. The AUC is an overall summary of diagnostic 
accuracy. AUC equals 0.5 when the ROC curve corresponds to random chance and 1 .0 for perfect accuracy. Reliable models (with median AUC >0.6) 
are displayed in red, while unreliable models (with median AUC <0.6) are displayed in gray. Models were evaluated in a five-fold cross-validation 
setting. (B) Motifs with the greatest predictive power for the liver model. The weights w of the motifs (see Materials and methods) are given in 
red. Motif weights have been scaled to [-1, 1], where 1 represents the scaled weight of the motif with highest predictive power, and -1 the scaled 
weight of the motif with the lowest negative predictive power (signs are preserved; see Materials and methods). The names of the features are listed 
near the baseline of the graph. For comparison, we include weights w for the same motif in the lung, caudate nucleus, thymus models (in different 
shades of gray). Similarities among the genes that were used to train the models - which reflect functional relatedness among tissues - explain 
similarities in the predictive power of the motif Thus, 15% of genes that are highly expressed in liver are also highly expressed in lung, while 
less than 5% are in caudate nucleus and thymus. 



related tissues (Figure SI in Additional file 1), confirm- 
ing that the models rely on tissue-specific motifs. We 
even obtained high AUC values for models in which, 
at first glance, we could not detect any significantly 
enriched motifs, such as for BM-CD71-I- early erythroid 
cells. This result suggests the existence of different subsets 
of promoters, with characteristic sequence features. 
Modest performance is likely explained by lack of se- 
quence features and/or relatively high heterogeneity of 
the promoters in the training set of the model. Thus, 
our models performed well even in the presence of a 
relatively large fraction of promoters overlapping CpG 
islands, but yielded higher AUC values when trained on 



CpG-poor promoters (with the mean fraction of pro- 
moters overlapping CpG islands being 0.58 for reliable 
models, as compared with 0.67 for unreliable models; 
Figure S2 in Additional file 1; Pearson's r^ = 0.1 with 
P-value = 0.001). Since genes expressed in the brain 
are strongly associated with CpG islands [31,32], many 
of the models yielding low AUC values involved brain 
tissues. The performance of the models is also negatively 
correlated with the fraction of promoters enriched in 
TATA-box motifs (with the mean fraction of promoters 
containing TATA boxes being 0.49 for reliable models, as 
compared with 0.57 for unreliable models; Figure S3 in 
Additional file 1; r^ = 0.4, P-value = 2.8 x 10'"). Additionally, 
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promoters of most highly expressed genes in rehable 
models are less conserved at the TSS compared to those 
in poor models (with average percentage of sequence 
identity between human and mouse of 0.63 for reliable 
models, as compared with 0.70 for unreliable models; 
Figure S4 in Additional file 1; r^ = 0.4, /'-value = 4.4 x 10'^°). 
The genes regulated by these promoters exhibit similar 
conservation trends. This result suggests that extensive 
use of promoters with tissue-specific activity could have 
arisen as a means to facilitate the acquisition of novel 
gene functions. 

We next observed that many of the most highly pre- 
dictive motifs for tissue-specific gene expression (that is, 
those with the largest positive weights; see Materials and 
methods) for reliable models are known to be involved 
in the regulation of the corresponding tissue. For in- 
stance, motifs with the highest predictive power (among 
the top 2%) for the liver model included binding sites for 
HNF4A, PPARA, NR1H3, and NR2F2 (Additional file 2), 
which are among TFs that have been experimentally 
shown to control hepatic function and development 
[33-36]. This analysis is limited in that TFs may recognize 
similar binding sites and in that motif databases are par- 
tially redundant. Thus, establishing the identity of the TF 
that may be binding to particular motifs is not trivial. 
However, taken as a whole, these observations suggest that 
our model specifically captures key regulators of the liver 
transcriptional network (Figure IB). Also, as expected, 
binding sites for the same TFs characterize models for 
tissues with similar gene expression profiles. For example, 
most brain tissues share binding sites for members of the 
Hox and Pax families of TFs (Additional file 3), confirm- 
ing the correlation between motifs with high predictive 
power and tissue-specific regulation. 

In summary, the strong predictive value of the motifs 
identified in promoter regions confirms that they are 
highly associated with tissue-specific gene expression, and 
substantiates the involvement of promoters in the regula- 
tion of tissue-specific expression. 

Promoter signatures identify tissue-specific enhancers 

We next assessed whether the models describing tissue- 
specific promoter activity could be exploited to discover 
enhancers. For this purpose, we applied each of the 73 
reliable models trained on promoter regions to predict 
enhancers in the loci of genes that were among the 200 
most highly or lowly expressed in the corresponding 
tissue (see Materials and methods). We evaluated only 
evolutionarily conserved non-coding sequences across 
the human and mouse genomes located at least 2.5 kb 
upstream and 0.5 kb downstream of the nearest TSS 
(see Materials and methods). Recent studies suggest 
that only about 50% are conserved in mammals, the re- 
mainder constituting lineage-specific elements (for example, 



[37-41]). This fraction, however, is expected to depend 
on the particular tissue where the enhancers are active. 
On the other hand, only 10% of the genomic sequence is 
conserved between mammals. This makes conservation 
an effective filter for enhancer identification. Indeed, inte- 
grating sequence analysis with comparative genomics has 
been shown to reveal important subsets of enhancers (for 
example, [17,18,42,43]). While restricting the analysis to 
conserved sequences implies a reduction in sensitivity, 
we considered this filter essential to increase the specifi- 
city of our approach. While tissue-specific enhancers 
often regulate gene expression over longer distances, 
they tend to be enriched near genes that are expressed 
and functional in the tissue of interest [40,44,45]. Hence, 
differences in the enrichment of candidate tissue-specific 
enhancers between the loci of genes most highly and lowly 
expressed in the corresponding tissue could be used as an 
indicator of whether the predicted enhancers do indeed 
drive tissue-specific expression. 

In 78% of tissues, our enhancer predictions are enriched 
in the loci of the 200 most highly expressed genes as com- 
pared to lowly expressed genes (P-values <0.05, Fisher's 
exact test; Figure 2A; see Materials and methods for 
details). The most pronounced enrichment in physiolo- 
gically normal tissue was observed in heart, lung, and 
liver, with fold differences of at least 4.5. Moreover, for 
44% of the tissues, we also found predictions in a sig- 
nificantly larger fraction of loci of most highly expressed 
genes as compared to loci of lowly expressed genes. For 
instance, we observed candidate liver enhancers in the 
60% of the loci of most highly expressed genes, but only 
in 43% of the loci of lowly expressed genes (P- value = 0.01, 
computed with Fisher's exact test; Figure 85 in Additional 
file 1). Finally, for 26% of the tissues the scores of the 
candidate enhancers were significantly greater in the loci 
of most highly expressed genes as compared with those 
in the loci of lowly expressed genes (P-value <0.05, 
Wilcoxon rank-sum test; Figure 86 in Additional file 1), 
suggesting that increasing the stringency of the prediction 
threshold would result in even stronger associations. In 
total, enhancer predictions in the loci of most highly 
expressed genes differed significantly from their counter- 
parts in the loci of lowly expressed genes for 85% of the 
tissues examined according to at least one of the above- 
mentioned criteria, with 14% (10) of the tissues exhibiting 
significant differences according to all of them (Table 1; 
Additional file 3). 

Fold enrichment between the proportions of enhancer 
predictions in the loci of the 200 most highly and lowly 
expressed genes is strongly correlated with the accuracy 
of the promoter models (Figure 2B). Promoter models 
that performed only modestly, such as those based on 
brain tissues (AUG <0.70), had limited success in predict- 
ing enhancers locus-wide (with fold enrichments reaching 
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Figure 2 Genome-wide enhancer predictions. (A) The number of enhancer predictions in the loci of highly expressed genes divided by the 
total number of sequences scanned in the loci of highly expressed genes (in red), as compared to the number of enhancer predictions in the loci 
of lowly expressed genes divided by the total number of sequences scanned in the loci of lowly expressed genes (in black), for 71 promoter-based 
models. Statistically significant differences are indicated by asterisks (P-values <0.05, Fisher's exact test). (B) Correlation between the fold enrichment 
between the proportions of enhancer predictions in the loci of highly expressed and lowly expressed genes with the cross-validation accuracy of the 
corresponding promoter-based models. (C) Overlap of liver enhancer predictions with strong enhancers predicted by ChromHMM in HepG2 cell lines 
[51], compared to random sequences with similar length in the loci of genes highly expressed in liver. (D) Overlap of liver enhancer predictions with 
DNase I hypersensitivity sites (DHS) in HepG2 cell lines from the ENCODE project, compared to random sequences with similar length in the loci of 
genes highly expressed in liver. 



at most 1.3), while well-performing promoter models 
(AUC >0.90), such as those for heart, liver, kidney, and 
lung, achieved greater fold enrichments of at least 2.8 (for 
kidney). We also observed slightly higher fold enrichments 
between the proportions of enhancer predictions in the 



loci of most highly and lowly expressed genes when the 
difference in GC content between the former and the 
latter was relatively large (log2 ratio of 0.13 as compared 
to -0.01 for lower fold enrichments between the propor- 
tions of enhancer predictions in the loci of highly and 



Taher ef al. Genome Biology 201 3, 14:R1 1 7 
http://genomebiology.com/201 3/14/1 0/R1 1 7 



Page 6 of 18 



Table 1 Tissues for which the promoter models produce the most robust sets of predictions for enhancers 

Tissue Number of enhancer predictions Fraction of loci with enhancer predictions Prediction scores AUC 
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Only promoter-based models yielding AUC greater than 0.6 were considered in this analysis. The performance of each model in predicting enhancers was 
assessed by the significance of the difference between the relative number of enhancer predictions in the loci of highly expressed and lowly expressed genes with 
respect to the total number of scanned sequences, the significance of the difference between the fraction of loci of highly and lowly expressed genes comprising 
enhancer predictions, and the significance of the difference between the scores of enhancer predictions in predictions in the loci of highly and lowly expressed 
genes (see Materials and methods). P-values were computed using Fisher's exact test and Wilcoxon rank-sum test. 



lowly expressed genes, P-value = 3.2 x 10'^^, Wilcoxon 
rank-sum test), suggesting a role for the GC content in 
the control of tissue-specific expression. 

In general, enhancers predicted in the loci of the 200 
most highly expressed genes in a given tissue were found 
to overlap extensively with experimental and computa- 
tional enhancer marks characteristic of functional regu- 
latory regions. For instance, candidate tissue-specific 
enhancers were found to be significantly enriched 
(P-value <0.05, Fisher's exact test) in binding sites for 
TFs within regulatory networks that are known to be 
important in the respective tissues, such as MYC and 
NFKBl in heart, and HNF4A and SPl in liver [46-49] 
(data not shown). The combined collection of enhancers 
predicted in the loci of the 200 most highly expressed 
genes in each of the tissues considered significantly 
overlap with ORegAnno, a manually curated collection 
of regulatory sequences [50], featuring a two-fold en- 
richment (P-value <0.001, computed based on 1,000 
randomized sequences genome-wide). Also, our enhancer 
predictions are enriched for specific epigenetic histone 
marks generally associated with distal transcriptional 
regulation, as suggested by 41% of predicted enhancers 
overlapping ChromHMM predictions for strong and weak 
enhancers (1.5-fold enrichment, P-value <0.001, computed 
based on 1,000 randomized sequences genome-wide [51]). 
In addition, our predictions are significantly associated 
with the enhancer chromatin signature H3K4mel (1.3- 
fold enrichment, P-value <0.001, computed based on 
1,000 randomized sequences genome-wide) and DNase I 
hypersensitive sites (DHSs) in different human cell lines, 
with a total of 42% of predicted enhancers overlapping 
1% of the DHSs (1.6-fold enrichment, P-value <0.001, 



computed based on 1,000 randomized sequences genome- 
wide). In particular, liver enhancer predictions extensively 
overlap with different enhancer marks, such as p300 bind- 
ing, chromatin marks, and DHSs (Figure 57 in Additional 
file 1). For example, 29% of liver enhancer predictions 
overlap chromatin marks and ChromHMM enhancer pre- 
dictions for the HepG2 hepatocellular carcinoma cell line, 
providing additional evidence for the tissue-specificity 
of the activity of the predicted enhancers (Figures 2C,D; 
Figure S8 in Additional file 1). Substantial overlap is 
also observed for other classifiers with DHSs (Figure S9 
in Additional file 1). Finally, we found that enhancer 
predictions are significantly enriched in matching p300 
embryonic brain, limb, and heart enhancers (2.5-fold 
enrichment, P-value <0.001, computed based on 1,000 
randomized sequences genome-wide, [45,52]). 

Taken together, these observations are consistent with 
our promoter-based models being able to predict enhancers 
that drive specific expression of neighboring genes in 
different tissues. 

Experimental assays validate tissue-specific activity of 
promoter-based enhancer predictions 

The most reliable evidence for the accuracy of our 
promoter-based models in predicting tissue-specific en- 
hancers is the experimental verification of their regula- 
tory activity in vivo. Substantiated by the consistent 
results from the computational analysis, we chose to 
validate a subset of liver enhancer predictions in the loci 
of highly expressed liver genes using a mouse liver re- 
porter assay [53,54]. We selected, as described in detail 
below, 12 out of the total of approximately 400 regions 
with predicted liver enhancer activity (Table 2) and 5 



Taher ef al. Genome Biology 201 3, 14:R1 1 7 
http://genomebiology.com/201 3/14/1 0/R1 1 7 



Page 7 of 18 



Table 2 In vivo assay of 1 2 liver enhancer predictions in mouse 



ID 


Coordinates [hg18] 


Score 


Location 


Activity 


Chromatin state° 


El 


Chrl 6:300091 97-30009301 


3.07 


Intronic {TBX6) 


Yes 


No 


E7 


Chrl 0:82023332-82023434 


2.60 


Intergenic (3' UTR of MAT! A) 


No 


Yes 


E2 


Chrl 7:69962796-69962894 


2.21 


Intergenic (4.5 kb downstream of GPRC5Q 


No 


No 


E8 


Chrl :3 1 679030-31 6791 49 


1.94 


Intronic (SERINC2) 


Yes 


Yes 


E12 


Chr3:1 3493491 1-1 34935091 


1.81 


Intronic (TF) 


Yes 


Yes 


E4 


Chrl 1 :721 38832-721 391 1 9 


1.83 


Intronic (ARAP1) 


Yes 


No 


E5 


Chrl 7:69957921-69958023 


1.30 


Intergenic (3' UTR of GPRC5Q 


Yes 


No 


E10 


Chrl 7:69951 076-69951 329 


1.57 


Intronic (GPRC5Q 


Yes 


Yes 


E3 


Chrl 1:72162942-72163179 


1.47 


Intronic {STARDIO) 


No 


No 


E9 


Chrl 1:72168912-72169046 


1.33 


Intronic (STARDIO) 


No 


Yes 


Ell 


Chrl 1 :721 66225-721 66509 


1.36 


Intronic (STARDIO) 


Yes 


Yes 


E6 


Chrl 7:1 7439720-1 743991 3 


1.04 


Intergenic (4 kb upstream of PEMJ) 


No 


No 



^Overlaps with 'strong enhancer' Chromatin State Segmentation by HMM from Broad Institute, MIT, and MGH in HepG2 cell lines. Enhancer predictions for which 
we observed in vivo activity in mouse liver are highlighted in bold. 



regions with no predicted activity as controls (Table 3) 
for functional testing. Importantly, we tried to ensure 
that the enhancer predictions tested were not signifi- 
cantly different from the whole set of predictions, and 
chose controls exclusively based on their score. Thus, 
differences between enhancer predictions and controls 
observed for other sequence properties simply reflect 
an association between high scores and the existence 
of functional constraints, rather than bias in the selec- 
tion of the sequences. Liver enhancer predictions se- 
lected for validation had an average score of 1.79, and 
were distributed across the complete range of scores 
(Figure SlOA in Additional file 1). Additionally, liver 
enhancer predictions selected for validation are located at 
an average distance to the nearest TSS of 7.3 kb, and are 
not significantly different from the entire set of liver en- 
hancer predictions (Figure SlOB in Additional file 1). 
Also, liver predictions selected for validation did not 
exhibit statistically significant differences in the level of 
evolutionary constraint compared to the entire set of liver 
predictions, with an average phastCons score [55,56] of 
0.38 (Figure SIOC in Additional file 1). Finally, while 
half of the regions with predicted liver enhancer activity 



were selected randomly, the remaining half was selected 
randomly among those predictions overlapping with 
strong enhancer predictions by ChromHMM in HepG2 
cell lines [51]. Controls were selected randomly among 
sequences that had scores in the bottom half of the 
score distribution for the full set of scanned sequences, 
had an average score of -1.70, were located 27.5 kb away 
from the nearest TSS, and had an average phastCons 
score of 0.37. Each liver enhancer prediction and control 
was cloned upstream of a minimal promoter element and 
the luciferase reporter gene (pGL4.23; Promega). Each 
construct was then injected using the hydrodynamic tail 
vein injection assay into at least three different mice, 
and liver enhancer activity was assayed after 24 h by 
measuring luciferase levels (see Materials and methods). 

We observed statistically significant enhancer activity 
for 7/12 (58%) enhancer predictions compared to empty- 
vector-injected mice, with no significant difference de- 
pending on how predictions were selected (two-tailed 
Fisher's exact test). The significant increase in luciferase 
activity driven by liver enhancer predictions ranged from 
2.0- to 6.4-fold relative to the empty vector. By compari- 
son, 0/5 of the controls activated the luciferase reporter 



Table 3 In vivo assay of regions with no predicted regulatory activity (controls) 



ID Coordinates [hg18] Score Location Activity 



CI 


Chrl 5:56263227-56263340 


-2.22 


Intronic IAQP9) 


No 


C2 


Chr6:26030369-26030485 


-2.03 


Intronic (SLC17A2) 


No 


C3 


Chr22:1 9494975-19495083 


-1.26 


Intronic (PI4KA) 


No 


C4 


Chr5:1 38482622-1 38482789 


-1.57 


Intronic (SlU) 


No 


C5 


Chr9:964B5498-96485627 


-1.32 


Intergenic (50 kbp upstream of FBPl and C9orf3) 


No 
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(false discovery rate adjusted q < 0.05; Figure 3, Tables 1 
and 2; Additional file 4). These data confirm that a large 
fraction of our liver enhancer predictions function as 
enhancers in vivo, regulating expression in the liver. 

Promoter-based models have the potential of shedding 
light on the human regulatory landscape 

We applied each of the 73 reliable models trained on 
promoter regions to the entire sequence of the human 
genome. We scanned approximately 1,200,000 non- 
promoter CNEs across the human and mouse genomes 
for enhancer signatures (see Materials and methods). 
No model generated more than 160,000 enhancer pre- 
dictions, with an average of approximately 51,000. We 
observed substantial overlap among enhancer predictions 
for related tissues (Figure Sll in Additional file 1), in 
part reflecting the resemblance between promoter-based 
models of tissues with similar gene expression profiles, 
but also indicating the existence of shared regulatory path- 
ways. Thus, from all sequences scanned, approximately 
900,000 (73%) were considered enhancer predictions for 
at least one of the models, with an average of approxi- 
mately 12,000 non-redundant enhancer predictions per 
tissue, consistent with current ChlP-seq findings [29]. 
Although we estimate the false positive rate at approxi- 
mately 5% based on the number of enhancer predictions 
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Figure 3 Experimental validation of liver enhancer predictions 
using the hydrodynamic tail vein enhancer assay. On each 
injection day, we also injected an empty pGL4.23[luc2] vector and a 
known liver enliancer of tlie ApoE gene as negative and positive 
controls, respectively. At least three mice were injected per construct. 
Statistical significance was tested using Student's f-test followed by 
multiple testing adjustment with Benjamini-Hochberg's method. The 
asterisks indicate statistical significance to control at adjusted 
P-value <0.05. 



in the loci of lowly expressed genes, a caveat of our ap- 
proach is that local differences in the composition of 
the human genome could result in overall higher false 
positive rates. Also, consistent with the literature (for 
example, [57]), we found that most loci in the genome 
contain more than one enhancer. Indeed, without con- 
sidering redundancy among predictions, we predict an 
average of four enhancers per locus per model, with the 
exact number depending on the tissue (Figure S12 in 
Additional file 1). 

We then analyzed the distribution of enhancer predic- 
tions across the genome relative to genes. From all se- 
quences that were classified as enhancer predictions by 
at least one of the models, 55% mapped within intronic 
regions, 43% mapped within intergenic regions, and the 
remaining 2% to UTRs. The trend is consistent for all 
tissues, in that the proportion of intronic enhancer 
predictions is always greater than that of intergenic 
predictions. Overall, tissue-specific enhancer predictions 
tend to be located closer to TSSs, and in particular, 
near TSSs of highly expressed genes in matching tis- 
sues. For example, there was more than 3-fold enrich- 
ment in liver enhancer predictions within 100 kb of the 
TSS of the 200 most highly expressed genes in the liver 
(P-value <0.001, computed based on 1,000 randomized 
sequences genome-wide), a number that increased to 
4- fold enrichment within 10 kb of the TSS (P-value <0.001, 
computed based on 1,000 randomized sequences genome- 
wide). Furthermore, stronger enhancer predictions are 
closer to TSSs than weaker predictions, with, for in- 
stance, the strongest 1% of liver enhancer predictions 
being located 40 kb away from the nearest TSS as com- 
pared to 73 kb for the complete set of liver enhancer 
predictions. These results are in agreement with the 
literature, and suggest that the functional relevance of 
a genomic region depends on its position relative to 
the TSS [58]. Our enhancer predictions are enriched 
near genes annotated with relevant gene ontology terms. 
For example, we found more than five-fold enrichment 
in liver enhancer predictions within the loci of genes 
associated with 'positive regulation of hepatic stellate 
cell activation', 'liver development', and 'positive regula- 
tion of hepatocyte differentiation' (P-values <0.05, Fisher's 
exact test), as well as enrichment for genes with critical 
liver functions, such as 'positive regulation of cholesterol 
metabolic process' (P-value = 2.7 x 10'^*, Fisher's exact 
test), 'triglyceride lipase activity' (P-value = 6.0 x 10' , 
Fisher's exact test), and sucrose, maltose, and trehalose 
metabolic processes (all P-values <0.05, Fisher's exact test). 

Although all our tissue-specific enhancer predictions 
were selected from conserved non-coding sequences across 
the human and mouse genomes, they exhibit different 
levels of conservation according to their phastCons scores 
(Figure S13 in Additional fOe 1). For example, liver and 
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heart enhancer predictions in the loci of highly expressed 
genes are significantly more conserved than the sequences 
used as basis for making predictions (0.41 versus 0.34, and 
0.43 versus 0.37, with P-values 1.1 x 10'^^ and 5.6 x 10'^^, 
respectively, calculated using the WUcoxon rank-sum 
test). For models that did not perform well in terms of 
their fold enrichment between the proportion of enhancer 
predictions in the loci of highly and lowly expressed genes 
(for example, skin and fetal brain), we observed signifi- 
cantly less constrained predictions. We observed similar 
trends when we applied our promoter-based classifiers 
to investigate unconstrained sequences (see Supplementary 
notes in Additional file 1). 

In summary, our results indicate the existence of largely 
disjoint sets of tissue-specific regulatory sequences located 
in the neighborhood of their potential target genes. They 
also confirm an important role for evolutionarily con- 
strained sequences, in that 73% of sequences conserved 
across mammals exhibit regulatory potential. Finally, 
consistent with previous studies, they support a role for 
both promoters and enhancers in determining spatio- 
temporal patterns of gene expression. 

Conclusions 

By analyzing the sequence of promoters of tissue-specific 
genes, we confirmed that tissue-specific promoters and 
enhancers share TF binding motifs within the loci of their 
cognate genes. Moreover, we observed that regulatory in- 
formation in the promoters of tissue-specific genes is pre- 
dictive of the enhancers targeting these genes. For 73/79 
tissues, we could reliably distinguish between highly and 
lowly expressed genes based exclusively on the presence 
or absence of putative motifs (AUC >60%). Although 
similar cut-offs have been recently employed (for ex- 
ample, [18,59]), we recognize that the half of the models 
exhibiting modest performances (AUC <80%) might have 
limited predictive value. It is, however, important to note 
that the reported AUCs represent the lower bound of 
the classifier accuracy due to the fact that the strength 
of the tissue-specificity enhancer signal is expected to 
vary among the promoters of tissue-specific genes. Pro- 
moters containing only weak signals will inherently de- 
flate the classification AUC estimates. To further address 
the performance of the classifiers at predicting tissue- 
specific enhancers, we introduced a panel of independent 
computational and experimental tests, which ultimately 
validated our analysis. Many of the TFs binding to the 
motifs that are identified as relevant to each of these 
models are known to play a fundamental role in the 
development or maintenance of normal function of 
the corresponding tissues. We showed that the motifs 
found in promoter regions can be used to predict en- 
hancers with matching tissue-specificity. The accuracy of 
our tissue-specific enhancer predictions by promoter- 



based models is supported by a highly significant associ- 
ation of enhancer predictions with the genes most 
highly expressed in a given tissue, and by a significant 
overlap of predictions with experimentally identified 
tissue-specific enhancers. 

More importantly, 58% (7/12) of liver enhancer pre- 
dictions generated by the promoter-based model drove 
luciferase expression in the liver following hydro- 
dynamic tail vein injection in mice, whereas none of the 
five negative controls did. 

Six of the seven validated liver enhancers were located 
within introns (for the genes TBX6, SERINC2, TF, ARAPl, 
STARDIO, and GPRCSC), while the remaining prediction 
was in the immediate vicinity of GPRCSC. These genes 
have been previously reported as moderately to highly 
expressed in liver and gallbladder [60]. For example, 
although the specific function of GPRCSC is unknown, 
the gene is highly expressed in the liver and has been 
suggested to play a role in signaling events when induced 
by retinoic acid [61]. In addition to other motifs, liver 
enhancer predictions that exhibited luciferase activity 
contained predicted binding sites for 5 to 11 out of 27 
known liver TFs (Additional file 5). Although each of the 
critical liver TFs PPARA, PPARG, NR2F2, and HNF4A 
had binding sites in 6/7 (86%) sequences, no single TF 
had a binding site in all 7 sequences that exhibited lucif- 
erase activity, highlighting the ability of our method to 
model flexible regulatory sequence encryptions. In turn, 
each of the 7 sequences included binding sites for at 
least 2 of these 4 TFs, and 4/7 (57%) for all 4. The 5 liver 
enhancer predictions that exhibited no significant lucif- 
erase activity contained binding sites for 4 to 8 out of 27 
known liver TFs. HNF4A had binding sites in all 5 se- 
quences, PPARA and NR2F2 had binding sites in 4/5 
(80%) sequences, and PPARG in 3/5 (60%). With one 
exception, each of the sequences included binding sites 
for at least 2 of these 4 TFs, and 3/5 (60%) for all 4. For 
comparison, 87% of all sequences scored by the liver 
promoter-based model contained 1 to 18 out of 27 known 
liver TFs. HNF4A had binding sites in 28% of the se- 
quences. Only 11% of the sequences had binding sites 
for at least two TFs among PPARA, PPARG, NR2F2, 
and HNF4A, and 7% for all four TFs. Despite limitations 
in the accuracy of the TF binding site predictions and 
despite the fact that many motifs may be nonfunctional 
[62], our results suggest that particular combinations of 
TFs, rather than single TFs, are necessary to establish 
liver transcription. In addition, the function of assayed 
sequences may be subject to activation or inhibition by 
additional cis- and traw-regulatory elements. For example, 
enhancer activity might be induced by hormones or drugs 
under particular conditions [63] or depend on neighbor- 
ing functional elements that are absent in the construct 
used for the experiment. This and other phenomena 
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could produce false-negative observations in the reporter 
assays. In any case, the experimental data presented here 
provide independent and robust validation of the enhan- 
cer predictions obtained with the promoter-based models, 
and lend further support for the hypothesis that the speci- 
ficity of interactions between enhancers and promoters 
is at least partly due to the binding of tissue-specific TFs. 

Our models predict multiple tissue-specific enhancers 
per locus and per tissue, as well as multiple tissues or do- 
mains of activity for most enhancers. This redundancy, 
which has long been reported (for example, [64-66]), may 
serve to increase the robustness of the regulatory network 
[67]. Furthermore, it is likely that apparently redundant 
enhancers activate gene expression in different cell types 
and/or under different developmental stages or conditions 
[68-71]. The genomic distribution of enhancers is also 
likely to vary depending on the function of their target 
genes. For example, the loci of transcription factors and 
developmental genes are known to contain particularly 
high densities of CNEs, many of which act as distal en- 
hancers [72-77]. More recently, advances in technical 
approaches, such as chromosome conformation capture 
and its derivatives, have confirmed these findings inde- 
pendent of sequence conservation [24,78] . We observed 
that relatively long loci, such as those of genes expressed 
in brain tissues, featured more enhancer predictions per 
locus compared to short loci. However, some compact 
loci, such as those of genes highly expressed in liver, lung, 
and heart, contained a relatively large number of enhancer 
predictions, providing evidence for a particular need for 
fine-tuning the expression level in these tissues. Further- 
more, the level of conservation of enhancer sequences is 
likely to depend, as other studies suggests (for example, 
[79,80]), on their particularly activity, although we found 
that, for all models, a large proportion of the enhancer 
predictions is likely to be conserved across mammals. 

Finally, our results add further evidence for a signifi- 
cant role of both promoters and enhancers in determin- 
ing tissue specificity. This role is supported by several 
examples from the literature [14,81,82]. Different enhancer- 
promoter preferences would provide an additional level 
of transcriptional control, assisting in establishing the 
favorable interactions, for instance, between enhancers 
and their cognate promoters when they are distant, or 
between enhancers and their cognate promoters within 
a gene cluster. The intimate coordination of promoters 
and enhancers in regulating tissue-specific transcription 
has immediate practical consequences. It makes it pos- 
sible to describe the complex regulatory landscape of 
higher eukaryotes, and eventually identify regulatory 
elements located hundreds of kilobases away from their 
target gene, based solely on the analysis of proximal regula- 
tory elements. DNA microarrays and, more recently, RNA- 
seq are currently being used to profile the transcriptomes 



of a diverse range of cell/tissue types, conditions, and spe- 
cies. As more expression data become available, particu- 
larly in the context of large projects such as ENCODE 
[83] and the 1000 Genome Project [84], it is our belief that 
the application of approaches such as the one we are 
proposing here will result in important new insights and 
improve our understanding of transcriptional regulation. 
Such projects are also generating a wealth of epigenetics 
information that can be easily integrated with our models 
to reveal genomic signatures controlling transcription. 

Materials and methods 

Gene annotation and expression data 

GNF Novartis Gene Expression Atlas version 2 [30] was 
extracted from the gnfAtlas2 table and mapped to the 
RefSeq [85] genes using the knownToGnfAtlas2 and kgXref 
tables (all tables are available in the UCSC Genome 
Browser database [86]). Thereby, we obtained expres- 
sion profiles in 79 tissues (721 B lymphoblasts, BM- 
CD105+ endothelial, BM-CD33+ myeloid, BM-CD34+, 
BM-CD71+ early erythroid, PB-BDCA4+ dentritic cells, 
PB-CD14+ monocytes, PB-CD19+ B cells, PB-CD4+ T 
cells, PB-CD56+ natural killer (NK) cells, PB-CD8+ T 
cells, adipocyte, adrenal cortex, adrenal gland, amygdala, 
appendix, atrioventricular node, bone marrow, bronchial 
epithelial cells, cardiac myocytes, caudate nucleus, cere- 
bellum, cerebellum peduncles, ciliary ganglion, cingulate 
cortex, colorectal adenocarcinoma, dorsal root ganglion, 
fetal brain, fetal liver, fetal lung, fetal thyroid, globus 
pallidus, heart, hypothalamus, kidney, leukemia chronic 
myelogenous (k562), leukemia lymphoblastic (molt4), 
leukemia promyelocytic (hl60), liver, lung, lymph node, 
lymphoma Burkitts Daudi, lymphoma Burkitts Raji, 
medulla oblongata, occipital lobe, olfactory bulb, ovary, 
pancreas, pancreatic islets, parietal lobe, pituitary gland, 
placenta, pons, prefrontal cortex, prostate, salivary gland, 
skeletal muscle, skin, smooth muscle, spinal cord, subtha- 
lamic nucleus, superior cervical ganglion, temporal lobe, 
testis Leydig cell, testis, testis germ cell, testis interstitial, 
testis seminiferous tubule, thalamus, thymus, thyroid, 
tongue, tonsil, trachea, trigeminal ganglion, uterus, uterus 
corpus, whole blood, whole brain) for 13,977 human 
genes. Overall, 5,023 genes were considered 'most highly 
expressed' in at least one of the 79 tissues. Additionally, 
6,531 genes were least expressed in these tissues. 

Locus definition 

In order to define gene loci, we first clustered together 
all overlapping transcripts in the refGene.txt and known- 
Gene.txt tables (available in the UCSC Genome Browser 
database [86]), and then assigned the closest half of the 
intergenic sequence separating two genes to each of the 
corresponding gene loci. Although the genes that are 
closest to the enhancers are reasonable target genes, there 
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are many known cases of enhancers located in introns 
of genes that are not their targets, as well as enhancers 
several kilobases away from their targets, with unrelated 
genes in between. Current integrative approaches result 
only in modest improvement in enhancer-target gene 
associations (for example, [87]), often requiring non- 
available data. Recently, a method based on Hi-C has 
been introduced to identify genome-wide functional do- 
mains based on higher-order chromatin interactions [5]. 
However, comparisons between alternative methods are 
limited because of the lack of an appropriate reference 
or gold standard. 

Promoter annotation and definition for promoter modeling 

Promoter regions were defined as encompassing a 3 kb 
region (2.5 kb upstream and 0.5 kb downstream of the TSS), 
relative to 5' TSSs of all transcripts annotated in RefSeq 
[85]. Although the total length is arbitrary, it intends to span 
both the core and proximal promoter regions. In most 
cases, the signal that turned out to be relevant for the 
models was detected within 500 bp of the TSS (Figure S14 
in Additional file 1). 

Gene expression values for each of the promoters of 
the most highly and least expressed genes in each of the 
79 tissues considered were extracted from [88]. Probe IDs 
were converted to UCSC Known Gene IDs using [89]. 
Subsequently, UCSC Known Gene IDs were converted 
to gene symbols and RefSeq IDs using [90]. Expression 
values for transcripts with the same gene symbol were 
averaged together. The 200 most highly and least expressed 
genes with different gene symbols were selected. TSSs of 
all RefSeq IDs associated with those gene symbols were 
then used to define 3 kb promoter regions. 

Sequence conservation of promoter regions 

Sequence identity of promoter regions was determined 
based on genome-genome alignment of human and mouse 
(from the net/chain track at UCSC [86]), using the hgl8 
and mm9 genome assembly, respectively. 

Sequence conservation of coding regions 

As an indicator of coding conservation across species we 
used the proportion of orthologs of human genes found 
in other eukaryotic species (HomoloGene Build 64 [91]). 

Motif occurrences 

Presence or absence of putative motifs was determined 
scanning the sequence for 775 motifs in TRANSFAC [92] 
and JASPAR [93-95] using MAST [96] with default 
parameters. 

Motif over-representation in promoter regions 

Over-representation of 775 motifs representing TF bind- 
ing sites in TRANSFAC and JASPAR among promoter 



regions of the 200 most highly expressed genes in each 
of the 79 tissues considered was determined by comparing 
the promoter regions of the 200 most highly expressed 
genes to the promoters of the 200 least expressed genes 
in the corresponding tissue. The entire length of the 
promoter region (-2.5 kb to +0.5 kb with respect to the 
TSS) was searched for motif occurrences with MAST. 
The numbers of putative TF binding site occurrences 
in each set of promoters were compared using the 
Wilcoxon rank-sum test. 

Transcription factors associated with transcription factor 
binding sites 

TF annotation for position-weight matrices (PWMs) was 
obtained from TRANSFAC [92], JASPAR [93-95], and 
the Broad Institute (MSigDB [97]). 

CpG islands and TATA-box motifs 

Annotation for CpG islands was obtained from the 
'cpglslandExt' UCSC track of the hgl8 assembly of the 
human genome database [86]. Presence or absence of 
TATA-box motifs in promoter regions was determined by 
scanning the sequence for TATA-box motifs in TRANSFAC 
[92] using MAST [96] with default parameters. 

Separating promoters of most highly and lowly 
expressed genes 
Training data 

The promoter regions (-2.5 kb to +0.5 kb with respect to 
the TSS, based on RefSeq annotation [85]) of the 200 most 
highly expressed genes (positive set) were compared to 
the promoters of 200 genes with the lowest expression 
(negative set) in each of the 79 considered tissues. 

Sequence representation 

Next, we converted the DNA sequence of each promoter 
into a set of TF binding site feature vectors. We first 
identified all CNEs (at least 70% sequence identity between 
human and mouse [98]) within each promoter sequence. 
Next, we ran the program MAST [96] with default pa- 
rameters to identify motif occurrences in the CNEs match- 
ing 775 known TF binding sites from the TRANSFAC [92] 
and JASPAR [93-95] databases. With this information, 
each CNE was then transformed into a 775-dimensional 
TF binding site feature vector, where each feature corre- 
sponds to the number of the corresponding TF binding 
site occurrences in the sequence of the CNE. There 
were 2.4 feature vectors (one per CNE) in a promoter, 
on average. 

For a given classifier, the training set contained as many 
feature vectors as the number of CNEs found in the pro- 
moters of the 200 most highly expressed genes (positive 
set) plus the number of CNEs found in the promoters of 
the 200 genes with the lowest expression (negative set). 
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Because promoter regions may overlap, the sets included 
only unique CNEs. 

Classifiers 

Linear SVMs [99] were used to find features relevant to 
distinguish between the CNEs in promoters associated 
with highly expressed genes (positive class) and those in 
promoters associated with lowly expressed genes (negative 
class). For each tissue, we trained a SVM on an average 
of 553 feature vectors representing CNEs in the promoter 
regions of highly expressed genes and 525 feature vectors 
representing CNEs in the promoter regions of lowly 
expressed genes. We optimized the weight of the positive 
class Wi by performing a grid search. The optimal value 
was chosen from Wi = ^Y, where }'€{|,|, is 
the number of signal sequences, and n is the number 
of control sequences. 

A double-loop cross-validation was used to assess the 
accuracy of the classifier. In each fold of the cross- 
validation, we used four-fifths of the members of the 
positive and negative classes to identify a 'consistent' set 
within the positive class. This strategy is aimed at iden- 
tifying sequences that are consistent with each other, 
in an effort to reduce the natural heterogeneity of the 
promoter sets. More precisely, in each fold of the 
cross-validation, for each promoter P in the positive 
class in the four-fifths of the data that was used to 
identify a consistent set, we trained a model excluding 
all sequences associated with P. Subsequently, we used 
that model to score each of the sequences associated 
with P. Finally, among those sequences, we randomly 
selected two positive-scoring ones to represent P in a 
'consistent' positive set. After repeating this for all pro- 
moters in the positive class, we obtained a 'consistent' 
positive set. This consistent positive set was used to- 
gether with the remaining one-fifth of the members of 
the negative class to train a final classifier. The accuracy 
of this final classifier was evaluated using a standard 
five-fold cross-validation. The entire procedure was re- 
peated for each of the five cross-validation folds, and 
the cross-validation was repeated five times. AUC was 
used as criterion for optimality. This double-loop cross- 
validation has been successfully applied to the enhancer 
prediction problem in the past (for example, [17]). 

Figure S15 in Additional file 1 illustrates the variation 
of the size of the consistent positive set for the 79 tissues 
considered. In our cross-validation framework, the con- 
sistent positive set contained an average of 157 CNEs, 
representing 35% of the training data. However, the size 
of the consistent positive set depends on the particular 
tissue, ranging from 39 (10%) to 269 (41%) CNEs for 
PB-CD56+ NK cells and medulla oblongata, respectively. 
For PB-CD56+ NK cells the consistent positive set also 



contained the smallest fraction of CNEs, while the lar- 
gest fraction was obtained for uterus corpus (51%). 

Linear SVMs Training a linear SVM classifier is equiva- 
lent to solving the following constrained optimization 
problem [100]: 

Given the training samples T = {(xi,yj)|xieRP,yj€ 
{-1, l}}"^j, find the values of w, b and that minimize 

^w^w+C^^i 

^ i=i 

satisfying the constraints 

yi(wTxi + b)>l-^i Vi= l,...,n 

and 

0 Vi= l,...,n 

The decision function of the classifier for an unknown 
sample x is given by: 

f (x) = sign(w^x + b) 

The dual form of this problem can be described as 
follows: Given the training samples T = {(xi,yj)|xiGRP, 
yje{-l,l}}"^j^, find the values {ai}|Li ^^'^ maximize 

2^ n n 

i ^ i=i j=i 

satisfying the constraints 

0 < ai <C Vi = 1, ...,n 

and 

n 

^a.y. = 0. 

i=l 

Samples X; for which aj > 0 are called support vectors. 
The vector w can be computed in terms of a, as: 

n 
i=l 

and, therefore, contains the weighted features of the 
support vectors. 

SVM parameter selection Linear SVMs have only one 
parameter, C, which controls the trade-off between 
errors on the training data and margin maximization. 
We found that the performance of the Hb enhancer 
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classifier was relatively stable with respect to changes 
in C. We estimated C based on the training data as 



. Misclassifications are penalized differently 



depending on the class of sequences, proportionally to the 
total number of sequences in each class. 

Predictive power of the motifs 

After obtaining a linear SVM model, the weight vector 
w can be used to decide the relevance of each feature 
[101]. The larger |wj|, the more important role of feature 
/ in the decision function. On these grounds, we used 
the weights Wj to assess the predictive power of each 
motif. 

Scaled SVM weights 

To make motif weights comparable across different SVM 
classifiers, we scaled them preserving their sign accord- 
ing to: 



scaled wj 



1- 



, if Wj < 0 

1 

if Wj >0 



where 



min^Wj^jif min{vi/y} <0 
0, otherwise 



and 



max 



{wj},if maxjivy} > 0 
0, otherwise 



GC content of transcription factor binding sites 

Sequence motifs representing motifs are usually encoded 
as PWMs. A PWM is a matrix containing the relative 
frequency of each of the four possible nucleotides at 
each position of a motif, which are estimates of the cor- 
responding probabilities. 

To obtain the GC content of a motif, we calculated 
and averaged the probability of observing G or C at each 
position of the corresponding PWM. 

In order to assess the contribution of the GC content 
to the performance of the promoter-based enhancer 
models, we trained 5 models using the aforementioned 
strategy, each time replacing the original 775 PWMs 
by an equally large collection of PWMs, in which the 
nucleotide probabilities of each PWM have been ran- 
domly permuted. 



Difference in GC content between two loci 

Differences in GC content between loci of highly and 
lowly expressed genes were expressed as the natural loga- 
rithm of the ratio between the GC content of the loci of 
highly expressed genes and the GC content of the loci 
of lowly expressed genes. 

Enhancer predictions 

We applied our promoter-based models as genome-wide 
predictors of human enhancers to both conserved and 
non-conserved sequences. In particular, for a given tis- 
sue, when we refer to predictions in the loci of the (200) 
most highly and lowly expressed genes, we imply predic- 
tions in the loci of the 200 genes with highest and lowest 
expression levels whose promoters were used to train 
the corresponding classifier. 

Prediction of conserved enhancers 

First, we selected CNEs with at least 70% identity across 
the human and mouse genomes [98] located at least 
2.5 kb upstream and 0.5 kb downstream of TSSs anno- 
tated in refGene.txt and knownGene.txt tables (available 
in the UCSC Genome Browser database [86]). Thus, we 
scored approximately 1,200,000 CNEs across the human 
genome, with an average length of 249 bp. In particular, 
the loci of the 200 most highly expressed genes in any 
of the 73 tissues considered comprised, on average, 85 
CNEs, and comprised a total of 500,000 CNEs, while the 
loci of the 200 genes with lowest expression in any of 
the 73 tissues considered included an average of 108 
and a total of 750,000, respectively. 

Prediction of non-conserved enhancers 

Second, we scanned the genome using a sliding window 
approach. Windows overlapping the sequence 2.5 kb up- 
stream and 0.5 kb downstream of the nearest TSS ac- 
cording to the refGene.txt and knownGene.txt tables 
(available in the UCSC Genome Browser database [86]) 
were excluded from further analysis. For the size of the 
window, we chose the average length of the conserved 
region between human and mouse [98], namely 230 bps. 
The sliding window is shifted by 115 bps. A given se- 
quence was considered an enhancer prediction (or enhan- 
cer candidate) if its score was greater than s = min(0, d), 
where d is the lowest score of the top 5% sequences scored 
in the control loci. 

Computational evaluation of genome-wide enhancer 

predictions 

Functional analysis 

To assess whether these elements disproportionally occur 
near genes with particular functions, we obtained the 
Gene Ontology [102] (CVS version 1.2811, GOC Valid- 
ation Date March 28, 2012) annotations of the closest 
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neighboring UCSC known genes [103] for all non-coding 
elements, and assigned those annotations to each elem- 
ent. Gene-to-GO mapping was achieved by combining 
the UCSC refGene.txt and knownGene.txt tables and 
GOA [104] association table using UniProt IDs. P-values 
were corrected for multiple testing using Bonferroni's 
method [105]. 

Fold enrichment of enhancer predictions in the loci of the 
200 most highly expressed genes as compared to the loci of 
lowly expressed genes 

In order to account for differences in the length of the 
loci, we did not directly compare the number of enhancer 
predictions in the loci of the 200 most highly expressed 
genes in a given tissue with the number of enhancer pre- 
dictions in the loci of lowly expressed genes in that same 
tissue, but the numbers of enhancer predictions divided 
by the numbers of scanned sequences for loci of highly 
and lowly expressed genes. Therefore, the fold enrich- 
ments in Table 1 and Additional file 3 were computed as 
the ratio of two proportions: (i) the total number of 
enhancers predicted in the loci of the 200 most highly 
expressed genes divided by the total number of sequences 
scanned in the loci of highly expressed genes; and (ii) the 
total number of enhancers predicted in the loci of lowly 
expressed genes divided by the total number of sequences 
scanned in the loci of lowly expressed genes. For the 73 
tissues evaluated and focusing only on CNEs across the 
human and mouse genomes, these proportions averaged 
0.04 for loci of highly expressed genes, and 0.03 for loci of 
lowly expressed genes. In the case of whole-loci predic- 
tions, these proportions averaged 0.03 for loci of the 200 
most highly expressed genes, and 0.02 for loci of lowly 
expressed genes. 

Fraction of loci comprising enhancer predictions 

The fraction of loci comprising enhancer predictions 
was defined as the number of loci in which at least one 
of the scanned sequences was considered an enhancer 
prediction divided by the total number of loci to which 
we applied the classifier. Therefore, the fold enrichments 
in Table 1 and Additional file 3 were computed as the 
ratio of two ratios: (i) the total number of loci of highly 
expressed genes comprising at least one enhancer pre- 
diction each divided by the total number of loci of highly 
expressed genes comprising at least one scanned sequence 
each; and (ii) the total number of loci of lowly expressed 
genes comprising at least one enhancer prediction each 
divided by the total number of loci of lowly expressed 
genes comprising at least one scanned sequence each. 
Each of the latter ratios ranges between 0 (no loci com- 
prising enhancer predictions) and 1 (all loci comprising 
scanned sequences also comprise enhancer predictions). 
For the 73 tissues evaluated and focusing only on CNEs 



across the human and mouse genomes, 59% of the loci 
of highly expressed genes comprised at least one enhan- 
cer prediction, while 52% of the loci of lowly expressed 
genes did. 

Overlap between predictions and different enhancer marks 

Predictions resulting from the 73 reliable promoter-based 
classifiers were combined into a set of non-redundant pre- 
dictions and overlapped with different enhancer marks. 
Additionally, when specifically stated, we report overlaps 
with predictions for particular promoter-based classifiers - 
for example, the classifier trained on liver promoters. 

Overlap with p300 Genomic regions enriched for p300 
in mouse forebrain, midbrain, limb, and heart tissues were 
extracted from Additional files 3, 4 and 5 [45], and 
mapped to the human genome (hgl8) using LiftOver 
[106]. Genomic regions identified in forebrain, mid- 
brain, limb, and heart were combined into one dataset. 
Overlapping genomic regions were clustered together. 

Overlap with DNase I hypersensitivity sites DNase I 
hypersensitivity data ('narrow peaks') for 86 human cell 
lines from the ENCODE project [25,107] were downloaded 
from the UCSC browser [108,109], converted to the hgl8 
assembly using LiftOver [106], and combined into one 
dataset. Overlapping genomic regions were clustered 
together. This resulted in a total of 1,722,559 non- 
overlapping regions with an average length of 253 bp. 
We then computed the intersection between the set of 
non-redundant enhancer predictions identified by any of 
the 73 promoter-based models and this DNase I hypersen- 
sitivity data dataset. Liver enhancer predictions, in particu- 
lar, were also compared with DNase I hypersensitivity data 
in HepG2. Predictions for enhancers in other tissues were 
compared with DNase I hypersensitivity data in closely 
related ENCODE tissues and cell lines (Figure S9 in 
Additional file 1). 

Overlap with histone modification marks Histone mark 
data (H3K4mel, H3K27ac) for 11 human cell lines from 
the ENCODE project [25,107] were downloaded from 
the UCSC browser [108,110], converted to the hgl8 
assembly using LiftOver [106], and combined into one 
dataset. Overlapping genomic regions were clustered to- 
gether. This resulted in a total of 189,889 non-overlapping 
regions with an average length of 6,275 bp. We then com- 
puted the intersection between the set of non-redundant 
enhancer predictions identified by any of the 73 promoter- 
based models and this histone mark dataset. Liver enhancer 
predictions, in particular, were also compared with histone 
marks in HepG2. 
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Overlap with ChromHMM predictions Weak and strong 
enhancers identified in nine human cell lines (HSMM, 
GM12878, HUVEC, Hl-hESC, K562, HepG2, NHEK, 
HMEC, NHLF) using ChromHMM [51] were down- 
loaded from the UCSC browser [108,111], converted to 
the hgl8 assembly using LiftOver [106], and combined 
into one dataset. Overlapping genomic regions were 
clustered together. This resulted in a total of 399,500 
non-overlapping regions with an average length of 
1,504 bp. We then computed the intersection between 
the set of non-redundant enhancer predictions identified by 
any of the 73 promoter-based models and the ChromHMM 
dataset. Liver enhancer predictions, in particular, were 
also compared with ChromHMM enhancers in HepG2. 

Conservation analysis 

PhastCons conservation scores [56] were based on align- 
ment of 28 vertebrate species and an 18 species placen- 
tal mammal subset, respectively [55]. 

In vivo validation of liver enhancer predictions 

Sequences selected for in vivo validation were PCR- 
amplified using TopTaq (Qiagen, Hilden, Germany) from 
human genomic DNA (Roche, Basel, Switzerland), puri- 
fied using the QIAquick PCR purification kit (Qiagen) and 
cloned into the pENTR-dTOPO vector (Life Technologies, 
Carlsbad, CA, USA). Proper insertion and orientation was 
confirmed by colony PCR, after which positive clones were 
transferred into the pGL4.23[luc2] vector (Promega) using 
the Gateway system (Life Technologies). Sequence and 
orientation of the insert were re-verified by Sanger se- 
quencing, and approximately 200 pg of endotoxin-free 
plasmid DNA was isolated using the EndoFree Plasmid 
Midi prep (Qiagen). 

For the hydrodynamic tail vein assay, 10 pg of each 
assayed sequence in pGL4.23[luc2] was injected along with 
2 pg of pGL4.74[hRluc/TK] vector to correct for injection 
efficiency, into at least three CDl mice (Charles River 
Laboratories, Wilmington, MA, USA) using the TransIT 
EE hydrodynamic gene delivery system (Mirus Bio LCC, 
Madison, WI, USA) according to the manufacturer's 
protocol. Negative (empty pGL4.23[luc2]) and positive 
(ApoE liver enhancer [110,112]) controls (n = 3 to 5) 
were also injected at each injection date/experiment. 
After 24 hours, livers were harvested and homogenized 
in passive lysis buffer (Promega), followed by centrifuga- 
tion at 4°C for 30 minutes at 14,000 rpm. Firefly and 
Renilla luciferase activity in the supernatant (diluted 
1:20) were measured on a Synergy 2 microplate reader 
(BioTek Instruments, Winooski, VT, USA) in technical 
replicates of four for each liver, using the Dual-Luciferase 
reporter assay system (Promega). The ratios for firefly 
luciferase:Renilla luciferase were determined and expressed 



as relative luciferase activity. All mouse work was approved 
by the UCSF Institutional Animal Care and Use Committee. 

Additional files 



Additional file 1: Figures S1 to 15 and Supplementary notes. 

Additional file 2: Table SI. A table listing the motif ranks for the 
promoter-based models. 

Additional file 3: Table S2. A table summarizing the performance of 
the promoter-based models. 

Additional file 4: Table S3. A summary of results obtained with the 
hydrodynamic tail vein injection assay. 

Additional file 5: Table S4. A list of transcription factors known to be 
relevant for liver function. 
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